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(57) A method and apparatus for extracting informa- 
tion from web servers connected to the Internet 30 in 
such a way that the extracted information can be re- 
used by a server 20 to generate a table of information 
extracted from a plurality of different web-sites in order 
to present a summary of the information to a user ter- 
minal 10 connected to the server 20. The information is 
extracted using a method which extracts the information 
on the basis of where the data is located within the web- 
site as viewed by a human viewer such that the method 
is robust to changes in the underlying structure of the 
HTML code which do not greatly affect the look of the 
web-page from a human user's perspective. 
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Description 

Technical Field 

[0001 ] The present invention relates to a method and 
apparatus for processing data presented in a semi- 
structured format, such as, for example, data found in 
internet web-sites. In particular, the present invention 
relates to a method and apparatus for generating com- 
puter programs adapted to automatically process web- 
sites and to attempt to extract certain desired data from 
the web-sites, and also to the computer programs gen- 
erated in this way. 

Background to the Invention 



[0002] Computer programs which are intended to au- 
tomatically process web-sites and return certain desired 
data items for further processing or for review by a user 
are known. Such programs are usually referred to as 
"wrappers". 

[0003] A paper by A. Sahuguetand F. Azavant entitled 
"Building light-weight wrappers for legacy web data 
sources using w4f" and published in The Proceedings 
of International Conference on Very Large Databases 
(VLDB), 1999 describes the "W4F (World-Wide Web 
Wrapper Factory) toolkit" which is a suite(ie a toolkit) of 
computer programs including a graphical user interface 
which can be used to generate Web wrappers. Each 
wrapper generated using the W4F toolkit is itself a Java 
program (in byte code - a so-called Java class file) and 
consists of three independent layers: a retrieval layer; 
an extraction layer and a mapping layer. The function of 
the retrieval layer is to locate and fetch Hyper-Text 
Markup Language (HTML) contentto be "wrapped"; this 
is done using a url which constitutes the input data 
passed to the wrapper. The extraction layer extracts de- 
sired data from the HTML content according to various 
rules specified by the wrapper author. The mapping lay- 
er specifies how the extracted data is exported (for view- 
ing by a user or for further processing by another com- 
puter program). The rules, which are specified by the 
wrapper author, are expressed using a high level com- 
puter language referred to as HEL (HTML Extraction 
Language) which permits a declarative specification of 
information to be extracted from the HTML content to 
be wrapped. 

[0004] HEL is a language which relies on a Document 
Object Model (DOM) representation of a piece of HTML 
content such as a web-site page. The language permits 
a user to specify the location of data to be extracted by 
reference to features explicitly coded into the HTML 
content (ie by reference to HTML tags). For example, 
HEL would permit a wrapper to extract the data located 
in the first and second columns of a table located at a 
specified position within the DOM tree of a web-page 
written in HTML. m 
[0005] Using HEL means that it is relatively straight- 



forward for a user to generate wrapping rules for a wrap- 
per to locate and extract the desired information from a 
web-site. Thus the W4F tool-kit includes a "wizard" with 
which a user may point to an item of information on a 
5 web-site which he/she would like the final wrapper, once 
made, to extract and the wizard automatically generates 
the correct HEL commands for extracting that piece of 
information. The HEL commands can then be compiled 
into a java class and the formation of the wrapper is com- 

10 piete. 4 . . 

However, wrappers generated in this way are not robust 
to even relatively small changes to the structure of a 
web-site caused, for example, by a web-site author up- 
dating, improving, enhancing or otherwise modifying his 
15 or her web-site. For example, because a wrapper gen- 
erated using the W4F toolkit relies heavily on the nature 
and position of HTML tags, if a web-author modifies a 
web-site to present the same information no longer in a 
table format but instead in a bullet point format, a new 
20 wrapper will have to be produced to obtain the same 
information from the modified web-site. 
Similarly, changing a heading from simple text to a hy- 
perlink, altering the ordering of certain text items to 
change the emphasis of certain pieces of information, 
25 etc. may all require a new wrapper to be created when 
using the W4F toolkit. 



Summary of the Invention 

30 [0006] According to a first aspect of the present inven- 
tion there is provided a method of extracting information 
from a specified ordered group of data containing semi- 
structured data, the method comprising the steps of: 

extracting data from within the specified ordered 
group of data from a location as specified by a plu- 
rality of rules; wherein 

the rules specify the location, within the specified 
ordered group of data, of data to be extracted, in 
terms of visual characteristics of the data to be ex- 
tracted as they appear when the specified ordered 
group of data is displayed using a computer appli- 
cation capable of displaying the semi-structured da- 
ta in a manner determined by formatting data in- 
cluded in the specified ordered group of data. 

[0007] The term semi-structured data is used to refer 
to data which, when displayed in the manner in which 
the author of the data intended for it to be displayed (eg 
so in the case of HTML content, when it is displayed using 
a web-browser adapted for display! ng HTML content, or, 
in the case of formatted text, when displayed using a 
word-processing application capable of correctly inter- 
preting the formatting instructions contained in the text 
55 file), has a logical structure to it which although not ad- 
hering to a strict formula specifying exactly where par- 
ticulates of data should be located, nonetheless has 
a certain level of structure for permitting a viewer to in- 
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tuitively gleam the facts of interest to the viewer in a rel- 
atively short period of time. For example, web-site pag- 
es designed for the sale of a particular model of mobile 
telephone will generally structure the layout of the text 
such that the most "important" information (eg the name 
of the model, the cost of the model, the networks on 
which the mobile can be operated and other specifica- 
tions of typical interest to a potential user) will usually 
be located in large text towards the top of the page, 
whilst less "important" information, such as standard 
terms and conditions of sale, exclusions of warranty, etc. 
will tend to be located in smaller text towards the bottom 
of the page. Similarly, a Curriculum Vitae is normally 
written in a semi-structured manner with the most "im- 
portant' information (eg a listing of qualifications and 
skills most relevant to a particular position for which the 
author of the Curriculum Vitae is applying) located near 
the top of the document possibly with some sort of for- 
matting emphasis (eg underling, italics or bold font), 
whilst less "important" information is located near the 
bottom of the document. 

[0008] Preferably, the ordered group of data is a data 
file storing, for example, HTML content, formatted text 
data or a combination of formatted text data and non- 
text data (such as pictures). In the case where the or- 
dered group of data is a data file storing HTML content, 
the rules specify the location of the data to be extracted 
in terms of visual characteristics of the data as they ap- 
pear when the data file is displayed using a web brows- 
er, and the formatting data included in the data file is in 
the form of HTML tags. 

[0009] Preferably, the rules are formed from a combi- 
nation of one or more specifications of the following vis- 
ual characteristics of the data to be extracted as they 
appear, relative to the appearance of the specified data 
file as a whole, when the specified file is displayed within 
the browser: the fractional area within the specified file, 
as it appears within the displaying computer application, 
of the data; the relative size of the font of the data if it is 
in the form of text; the colour of the data; the proximity 
of the data to a specified keyword; and the location of 
the data relative to a specified keyword. 
[0010] The method may be carried out using compu- 
ter processing means capable of reading the specified 
file and writing extracted data to a suitable memory. In 
such a case, the method may include the step of input- 
ting the rules to the computer processing means. Pref- 
erably, in such a case, the rules are inputted using a 
graphical user interface which prevents a user from 
specifying rules which depend upon the structure of 
HTML code within the file instead of the visual charac- 
teristics of the data as they appear when the specified 
file is displayed. 

[0011] According to a second aspect of the present 
invention, there is provided a method of generating a 
set of rules for extracting one or more sets of data from 
a specified ordered group of data containing semi-struc- 
tured data, the method comprising the steps of: 



controlling the display of a list of names of visual 
characteristics suitable for describing the location 
of a set of data to be extracted from a specified or- 
dered group of data containing semi -structured da- 

5 ta; and 

generating one or more rules involving a combina- 
tion of one or more of the displayed characteristics 
together with one or more specified" values of the 
characteristics, in response to input signals gener- 

10 ated by a user. 

[0012] Preferably, the method further includes the 
step of converting the rules into computer implementa- 
ble instructions for processing an input file containing 
*s HTML content and to output the data which would be 
displayed at the location specified by the rules if the in- 
put file were displayed in a browser for displaying HTML 
content. Preferably, the computer implementable in- 
structions are in the form of a Java class file. 
[0013] According to a third aspect of the present in- 
vention, there is provided a device for generating a set 
of rules for extracting one or more sets of data from a 
specified ordered group of data containing semi-struc- 
tured data comprising: 

display control means for controlling the display of 
a list of names of visual characteristics suitable for 
describing the location of a set of data to be extract- 
ed from a specified ordered group of data contain- 
ing semi-structured data; and 
rule generation means for generating one or more 
rules comprising a combination of one or more of 
the displayed characteristics together with one or 
more specified values of the characteristics: where- 
in the specified values of the characteristics are set 
in response to input signals generated by a user via 
a user interface. 

[0014] Preferably, the device further includes conver- 
sion means for converting the generated rules into com- 
puter implementable instructions for processing an input 
file containing HTML content and outputting the data 
which would be displayed at the location specified by 
the rules if the input file were displayed in a browser for 
displaying HTML content. Preferably, the computer im- 
plementable instructions are in the form of a Java class 
file. 

Brief description of the drawings 

[0015] In order that the present invention may be bet- 
ter understood, embodiments thereof will now be de- 
scribed by way of example only with reference to the 
accompanying drawings in which: 

Figure 1 is a block diagram of a computer network 
including a server computer running a web-wrapper 
program for extracting information from web sites 
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hosted on other computers connected to the Inter- 
net; 

Figure 2 is a flow chart illustrating the steps per- 
formed by the web-wrapper program running on the 
server computer illustrated in Figure 1 , including a 
subroutine for processing a particular page of a par- 
ticular web-site; 

Figure 3 is a flow chart illustrating the steps per- 
formed by the subroutine illustrated in Figure 2 for 
processing a particular page of a particular web- 
site; 

Figure 4 is a screen dump of an HTML file suitable 
for processing by the web-wrapper program illus- 
trated in Figures 2 and 3 as viewed by a web brows- 
er; 

Figure 5 is a flow chart illustrating the steps per- 
formed in generating a set of wrapping rules for use 
in forming a web-wrapper, using a web-wrapper de- 
velopment tool application; 

Figure 6 is a screen dump of a wordprocessor file 
suitable for processing by the web-wrapper pro- 
gram illustrated in Figures 2 and 3 as viewed using 
a compatible wordprocessor application; and 

Figure 7 is a block diagram of a computer network 
comprising a user terminal computer connected to 
the Internet and running a web-wrapper program for 
extracting information from a web-site hosted on a 
server computer connected to the Internet. 



Description of the First Embodiment 



[001 6] Figure 1 shows a computer network compris- 
ing a userterminal computer 10, operated by a user (not 
shown), which is connected, via a data connection 11, 
to a server computer 20. The server computer 20 is 
owned by an Internet Service Provider used by the user 
to access the Internet 30 to which the server computer 
20 has a fast data connection 21 . One of the web-sites 
hosted by the server computer 20 in this embodiment is 
a mobile telephone comparison web-site for obtaining 
the latest prices available for a number of different mo- 
bile telephone models from a number of different web- 
sites hosted on a variety of different server computers 
connected to the Internet. The mobile telephone com- 
parison web-site collects this information together into 
a comparison table which may be viewed by a user with- 
in a single web page, forming part of the mobile tele- 
phone comparison web-site, for ease of comparison. 
[0017] In orderto obtain the information to be present- 
ed in the comparison table, the server computer 20 runs 
a computer program which periodically accesses each 
of a number of different web-sites which each include 



an offered price for a particular model of mobile tele- 
phone The Hyper Text Mark-up Language (HTML) con- 
tent comprising each such web-site is processed by the 
program and the current value of the offered price (to- 
s gether with other useful information included in the ta- 
ble) is extracted from the HTML content and used to up- 
date the comparison table. 

[0018] The steps performed by the computer program 
are now described in greater detail with reference to Fig- 
to ures 2 and 3. Upon commencement of the program, the 
program imports at step S1 0 a list of HTML files (each 
of which is specified by reference to a URL or an ex- 
pression which evaluates to a URL) for processing by 
the program. Flow then proceeds to step S20 in which 
15 the first of the HTML files specified in step S1 0 is select- 
ed and downloaded for processing. 
[0019] Flow then passes to subroutine S30 in which 
the HTML file downloaded in the previous step is proc- 
essed to attempt to retrieve the desired data from the 
20 file. 

[0020] In the present embodiment, the desired data 
is: i) the name of the model of mobile telephone; ii) the 
name of the network operator associated with the tele- 
phone for the specified price; Hi) the original or high 
25 street price associated with the mobile telephone; iv) the 
current or offered price advertised on the web-site being 
processed; v) the indicated saving (ie the difference be- 
tween the original or high street price and the current or 
offered price advertised by the web-site); and vi) the cur- 
se rent*vailability of the telephone (ie whether or not the 
item is in stock). 

[0021] Once the subroutine has finished, the success- 
fully extracted data are stored in appropriate fields (ie 
named memory locations from which the stored data 
35 can be readily accessed), and flow passes to step S40 
at which it is determined whether or not all of the HTML 
files specified in step S1 0 have now been processed. If 
not flow passes to step S50 in which the next HTML file 
specified in step S1 0 is downloaded for processing and 
40 flow returns to step S30 for processing the newly down- 
loaded HTML file. If it is determined at step S40 that all 
of the HTML files specified in step S10 have now been 
processed, flow passes to step S60. 
[0022] in step S60, the data which has been newly 
45 stored in a set of appropriate fields is made available for 
other programs to access. In the present embodiment, 
access to the fields in question is prevented while the 
program is running (ie during steps S10 to S50) to pre- 
vent any external programs from attempting to read the 
so fields during a write operation when the data stored will 
be invalid. Such procedures are well understood in the 
field of object oriented programming and will not be de- 
scribed here at greater length. 

[0023] Upon completion of step S60, flow passes to 
55 step S70 at which it is tested if sufficient time has 
elapsed for the fields to be re-updated. In the present 
embodiment, the program is run every hour to ensure 
that the data displayed in the comparison table is never 
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more than one hour out of date. Since mobile telephone 
prices rarely change more frequently than every few 
days, this is considered to be amply frequent. 
[0024] Step S80 combines with step S70 to form a 
continuous waiting loop. If it is not detected to be time 
for re-running the program, it is checked in step S80 
whether or not the program has received an instruction 
to quit; if so, the program ends, otherwise flow repeat- 
edly returns to step S70 until it is time for the program 
to be re-run at which point flow passes back to step S 1 0 
and the entire program is re-run. 
[0025] Referring now to Figure 3, upon commence- 
ment of subroutine S30, flow passes to step S100 in 
which the wrapping rules for use in processing the re- 
cently downloaded HTML file (downloaded either in step 
S20 or S50 as appropriate) are imported from a store of 
wrapping rules. In the present invention, a single set of 
wrapping rules is used for processing each of the HTML 
files specified, by reference to a URL, in step S1 0. How- 
ever, alternative embodiments could use different sets 
of wrapping rules for different URLs. However, it is im- 
portant to note that the wrapping rules employed in the 
present embodiment are robust to changes in the con- 
tent of the HTML files caused by updating of the HTML 
file to reflect, for example, changes in the offered price 
of a mobile telephone. Thus even if different wrapping 
rules are used for different URLs, the wrapping rules 
should not require updating every time the HTML con- 
tent located at the specified URL is updated. In the 
present embodiment, the wrapping rules are sufficiently 
robust that only a single set of wrapping rules is required 
for all of the specified URLs and despite minor modifi- 
cations occurring at those URLs in order to periodically 
update the HTML content stored at those URLs. 
[0026] In the present embodiment, there are eight 
wrapping rules each of which is specified by setting a 
value for each of seven different rule fields. The seven 
different rule fields are: \)attribute name; \\) attribute lo- 
cation; iii)attribute content; \v)attribute range; v)data ex- 
traction method;V\)data content; and v\\)data range. The 
eight wrapping rules are: 

Wrapping rule 1 : 

attribute name: "Brand_qfJMobilephone" 

attribute location: "heuristic matching" 

attribute content: "biggest size" 

attribute range: "top" 

data extraction method: "direct" 

data content: null 

data range: 0 

Wrapping rule 2: 

attribute name: "Service_Provider" 
attribute location: "data matching" 
attribute content: "btcelinet", "orange", 
"one2one" 



attribute range: "top" 

data extraction method: "matched" 

data content: null 

data range: 0 

5 

Wrapping rule 3: 

attribute name: "Original_Price" 
attribute location: "domain matching" 
10 attribute content: "rrp" 

attribute range: "whole" 
data extraction method: "indirect" 
data content: "next currency" 
data range: 3 

15 

Wrapping rule 4: 

attribute name: " Current JP rice" 
attribute location: "domain matching" 
20 attribute content: "only" 

attribute range: "whole" 
data extraction method: "indirect" 
data content: "next currency" 
data range: 3 

25 

Wrapping rule 5: 

attribute name: "Comparable_Price" 
attribute location: "domain matching" 
30 attribute content: "save" 

attribute range: "whole" 
data extraction method: "indirect" 
data content: "next currency" 
data range: 3 

35 

Wrapping rule 6: 

attribute name: "Availability" 
attribute location: "domain matching" 
40 attribute content: "in stock" 

attribute range: "whole" 
data extraction method: "boolean" 
data content: null 
data range: O 

45 

Wrapping rule 7: 

attribute name: "Sale_ID" 

attribute location: null 
50 attribute content: null 

attribute range: null 

data extraction method: "generate" 

data content: "Brand_of_Mobilephone", 

"Service_Provider" 
55 data range: null 

Wrapping rule 8: 



BNSDCCID: <EP 1349083A1_I_> 



9 



attribute name: "URL" 

attrubute location: null 

attribute content: null 

attribute range: null 

data extraction method: "generate'' 

data content: "URW 

data range: null 
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[0027] Upon completion of step S1 00, flow passes to 
step S110 in which the first of the eight wrapping rules 
imported in step S100 is selected (ie Wrapping rule 1 is 
selected). 

[0028] Having selected a wrapping rule, flow passes 
to step S120 in which the fraction of the web-site page 
corresponding to the HTML file being processed in 
which the data to be extracted is expected to be located 
is selected. This is done, in the present embodiment, by 
executing a method called cutAttributePage which takes 
as arguments the HTML file being processed and the 
rule field attribute range and then storing the resultant 
output in the intermediate processing field attrib- 
utePageFraction. 

[0029] If the rule field attribute range has the value 
"top" (see Wrapping rules 1 and 2) then, in the present 
embodiment, only the first 1 2000 characters of the body 
of the HTML file are stored in the intermediate process- 
ing field attributePageFraction and considered in steps 
S130 to S150. If the field attribute range has the value 
"middle" then only the characters between the 6000* 
and the 18000 th characters are considered in subse- 
quent steps. If the field attribute range has the value 
"bottom" then only the characters after the 1 2000 th char- 
acter are considered in subsequent steps. If the field at- 
tribute range has the value "whole" (see Wrapping rules 
3 to 6) or null (which is the default value - see Wrapping 
rules 7 and 8) then all characters in the HTML file are 
stored in the field attributePageFraction and considered 
in the subsequent steps. 

[0030] Note that the above described implementation 
is a rather simple implementation of the method cutAt- 
tributePage which is adequate for typical web-sites ad- 
vertising mobile telephones. However, it is clear that a 
more sophisticated implementation of the method could 
be used which first calculates, say, the number of lines 
of text in the web-site and then divides these into three 
equal parts and considers only those characters falling 
within the respective specified part, etc. 
[0031] Upon completion of step S120, flow passes to 
step S1 30 in which the approximate location of the data 
to be extracted is identified. In the present embodiment, 
this is performed using the method locateAttribute which 
takes as arguments the intermediate processing field at- 
tributePageFraction and the rule fields attribute location 
and attribute content and storing the resultant output in 
the intermediate processing field attributeLocation. 
[0032] In the present implementation of the method 
locateAttribute, the method behaves differently depend- 
ing upon which of the allowed values the rule field at- 



tribute location takes (the allowed values in the present 
implementation are "heuristic matching" (see Wrapping 
rule 1), "data matching" (see Wrapping rule 2), "domain 
matching" (see Wrapping rules 3 to 6) and null (see 
5 Wrapping rules 7 and 8)). Where attribute location takes 
the value "heuristic matching", the method expects the 
field attribute content to take either the value "biggest 
size' (see Wrapping rule 1 ) or "bold font" and searches 
either through all of the HTML tags indicating a special 
w font size and outputs the sequential position of the char- 
acter immediately following the HTML tag indicating the 
largest font size in the data stored in the field attrib- 
utePageFraction or through the HTML tags until a tag is 
found indicating that the following text is to be displayed 
15 in a bold font and outputs the sequential position of the 
character immediately following such a tag respectively. 
[0033] Where attribute location takes the value "data 
matching" the method attempts to find a match between 
any of the words or phrases (ie single strings) stored in 
20 the rule field attribute content with any part of the data 
stored in the field attributePageFraction. Upon finding 
such a match, the sequential position of the character 
at the start of the matched word or phrase in the field 
attributePageFraction is output and stored in the inter- 
25 mediate processing field attributeLocation (as distinct 
from the rule field attribute location). 
[0034] Where the rule field attribute location takes the 
value "domain matching" the method behaves in a sim- 
ilar way to the "data matching" case described above. 
30 Thus it considers the string or strings contained in the 
rule field attribute location and attempts to match this up 
with the text contained in the intermediate processing 
field attributePageFraction. As soon as a match is found, 
the sequential position of the character at the start of 
35 the matched word or phrase in the field attributePage- 
Fraction is output and stored in the intermediate 
processing field attributeLocation. 
[0035] Upon completion of step S1 30, flow passes to 
step S140 in which a portion of the text stored in attrib- 
40 utePageFraction is selected which is intended to include 
the desired data to be extracted. In the present embod- 
iment, this is performed by using the method cutData- 
Page which takes as arguments the intermediate 
processing fields attributePageFraction and attributeLo- 
45 cation and the rule field data range and then storing the 
resultant output in the intermediate processing field da- 
taPageFraction. The selected text is that commencing 
at the location specified in the intermediate processing 
field attributeLocation and extending for as many words 
so as are specified in the rule field data range. 

[0036] Upon completion of step S140, flow passes to 
step S150 in which the data to be extracted is selected 
from the text selected in step S140. In the present em- 
bodiment, this is performed using the method extract- 
55 Data which takes as arguments the intermediate 
processing field dataPageFraction and the rulefields da- 
ta extraction method and data content and storing the 
resultant extracted data in the intermediate processing 
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field attributeData. 

[0037] In the present embodiment, the operation of 
the method extractData varies in dependence upon the 
content of the rule field data extraction method which, 
in the present embodiment, can take on one of: s 
"matched", "direct", "indirect" and "generated". In the 
case where the rule field attribute location takes the val- 
ue "data matching", the rule field data extraction method 
should take the value "matched" and, in this case, the 
content of the intermediate processing field dataPage- to 
Fraction should already hold the desired data to be ex- 
tracted and therefore the method extractData simply re- 
turns the contents of dataPageFraction for storing in at- 
tributeData. 

[0038] Where the rule field attribute locationxakes the 15 
value "heuristic matching" or "domain matching" the rule 
field data extraction method can take on any of "direct", 
"indirect* or "Boolean" (see Wrapping rules 1 , 3, 4, 5 and 
6). Where it takes the value "direct" (see Wrapping rule 
1 ), the extractData method again simply returns the con- 20 
tents of dataPageFraction. Where it takes the value "in- 
direct" (see Wrapping rules 3, 4 and 5) the extractData 
method examines the contents of the rule field data con- 
tent and extracts the first occurrence of data of the spec- 
ified type of data content (eg "next currency" in Wrap- 25 
ping rules 3, 4 and 5 which causes the extractData meth- 
od to return the first detected currency - designated by 
a £ sign in the present embodiment - from the text stored 
in dataPageFraction). Where it takes the value 
"Boolean" (see Wrapping rule 6) the method extractData 30 
() returns a Boolean value (ie true or false) depending 
on whether the string found in the rule field attribute con- 
tent (eg "in stock") is matched in the text contained with- 
in the intermediate processing field dataPageFraction. 
[0039] The final possibility for the rule field data ex- 35 
traction method is that it contains the term "generate" in 
which case, the content of the intermediate processing 
field dataPageFraction (which should generally be null 
when "generating" rather than "extracting") is ignored. 
Instead, in this case, the contents of the rule field data 40 
content are examined and where the contents are spe- 
cific values of the rule field attribute name (ie specific to 
a particular wrapping rule), the extracted data associat- 
ed with the thus identified wrapping rules are selected 
and combined and then stored in the intermediate 45 
processing field attributeData (see Wrapping rule 7). 
Where the contents area single string, this single string 
is simply returned and stored directly in the intermediate 
processing field attributeData (see Wrapping rule 8). 
[0040] Upon completion of step S1 50, flow passes to so 
step S160 in which the data extracted in the preceding 
step is combined with the associated attribute name to 
form a pair of data which forms the final output of the 
method. In the present embodiment, this is performed 
using a method called generatePair which takes as ar- 55 
guments the rule field attribute name and the interme- 
diate processing field attributeData and combines the 
contents of these fields to form a data pair which is 



stored in the output field pair for use by external pro- 
grams. 

[0041] Upon completion of step S1 60, flow passes to 
step S170 in which the field pains output. In the present 
case, this involves unlocking the output field pair so that 
external programs or threads may access the contents 
of this field. Note that the output field pair is, in this em- 
bodiment, one of an array of such output fields, such 
that there is one output field pair for each wrapping rule, 
and the particular member of the array can be accessed 
by using the wrapping rule number as an index or by 
referral to the content of the rule field attribute name for 
each particular wrapping rule. 

[0042] Upon completion of step S1 70, flow passes to 
step S180 in which it is determined if there are more 
rules still to be processed for the currently selected 
HTML file. If so, flow passes back to step S120. Other- 
wise, the subroutine ends and flow passes to step S40 
(see Figure 2). 

[0043] Figure 4 shows a representation of an example 
HTML file forming a mobile phone product web page as 
viewed using a web browser, which is suitable for 
processing using the seven wrapping rules set out 
above, in combination with the processing method de- 
scribed above with reference to Figures 2 and 3. Con- 
sidering the HTML file displayed in Figure 4, consider, 
by way of example, the use of wrapping rule 1 above: 

Step S120: attributePageFraction <- cutAttributePage 
(Page, rule.attribute range). 

[0044] By substituting the parameter rule field at- 
tribute range in the above statement with the value spec- 
ified in Extraction Rule 1 , we have: attributePageFrac- 
tion <- cutAttributePage(Page f "top"). Here Page refers 
to the one shown in Figure 4. The function of cutAttrib- 
utePage(Page, rule.attribute range) is used to return a 
fraction of the Page specified by rule.attribute range. In 
this case, the top one-third of the original page is cut out 
and stored in attributePageFraction. 

Step S220: attributeLocation <- focateAttribute(rule. 
attribute location, rule.attribute content, 
attributePageFraction) . 

[0045] By substituting the parameters in the above 
statement with the values specified in wrapping rule 1 , 
we have: attributeLocation <- locateAttribute{" heuristic 
matching", "biggest size", attributePageFraction). Here 
the function of locateAttribute(rule.attribute location, 
rule.attribute content, attributePageFraction) is used to 
locate the attribute within attributePageFraction by us- 
ing rule.attribute location and rule.attribute content In 
this case, a heuristic matching rule, "biggest size", is 
used to locate the attribute "Brand_ofJv1obilephone". 
This rule is executed by finding the text of the largest 
size in attributePageFraction. The text of the largest size 
in attributePageFraction is "Nokia 3310". The starting 
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point of this text is the location of the attribute 
"Brand_of__Mobilephone". 

Step S140: dataPageFraction <- cutDataPage 
(attributePageFraction, attributeLocation, rule.dafa 
range). 

[0046] By substituting the parameters in the above 
statement with the values specified in the wrapping rule 
1 , we have dataPageExtraction <- cutDataPage(attrib- 
utePageFraction, attributeLocation, 0). This acts to cut 
a minimum fraction of page out of attributePageFraction 
in which the attribute and the data associated with it are 
expected to occur. attributeLocation specifies the start- 
ing point of this page fraction and rule.dafa range spec- 
ifies the length of this page fraction. By assigning 0 to 
rule.dara range, this means that the data will be extract- 
ed from the current location of the attribute. At the end 
of this step, therefore, dataPageFraction will be as- 
signed "Nokia 331 0". 

Step S150: attributeData <- extractData 
(dataPageFraction, rule.data extraction method, rule. 
data content). 

[0047] By substituting the parameters in the above 
statement with the values specified in wrapping rule 1 , 
we have attributeData <- extractDataC 'Nokia 3310", "di- 
rect", null). This acts to extract data from the fraction of 
the web page specified by dataPageFraction, by using 
data extraction method specified by rule.data extraction 
method and rule.data content In this case, "Nokia 331 0" 
will be returned directly as the data for the attribute. 

Step S160: pair<- generatePair{ru\e. attribute name, 
attributeData). 

[0048] By substituting the parameters in the above 
statement with the values specified in the wrapping rule 
1 , we have pair <- peneratePa/rC'Brand_of_ Mobile- 
phone", "Nokia 331 0"). This acts to generate and return 
a pair of (rule.attribute name, attributeData). In this case 
(•'Brand_of_Mobilephone", "Nokia 3310") is generated 
and assigned to pair. 

[0049] The other wrapping rules are processed in a 
similar manner to give rise to the following output pairs: 

pair(1):- ("Brand_of_Mobilephone", "Nokia 3310"), 
pair(2):- ( u Service_Provider", "btcellnet"); 
pair(3)> ("Original_Price", "£129.99); 
pair(4):- ( u Current_Price ,, ! "£107.99"); 
pair(5):- ("Comparable_Price", "£22.00"); 
pair(6):- ("Availability", true); 
pair(7):- ("SaleJD", "Nokia 3310j3tcellnet")- 



[0050] Pair(7) and pairs(3-6) are then used to update 
the comparison table contained in the mobile telephone 
comparison web-site. 



[0051] Referring now to Figure 5, the way in which 
wrapping rules 1 to 7 are created is now described. An 
important aspect of the present invention is that both the 
processing methods and the wrapping rules input to 
5 those methods should be constrained such that not all 
types of access methods for locating data within an 
HTML or similar file are possible. This is achieved in the 
present embodiment by providing a user interface for 
creating wrapping rules which only permits certain op- 
10 tions to be followed by a user to prevent the user from 
accidentally specifying wrapping rules which may be ei- 
ther incompatible with the method for processing the 
rules or, importantly, may be less robust to minor mod- 
ifications in the content of the HTML or similar files to 
15 be processed than the wrapping rules which are permit- 
ted to be created by the user interface. 
[0052] In the present embodiment, the user interface 
proceeds according to the method illustrated by the flow 
chart of Figure 5. Thus, upon commencement of the 
20 method at step S200, the user is prompted to enter the 
first rule field for the rule, namely the attribute name (ie 
rule field <attribute name>). The user has virtually com- 
plete freedom of choice over what to enter at this stage. 
If, however, the user enters the same name here as for 
25 an already existing rule, the user may be prompted to 
enter a different name, to provide a new package name 
or other higher level grouping for overcoming the con- 
flict, or to accept that the pre-existing rule will be over- 
written by the new one about to be created. 
30 [0053] Upon completion of step S200, flow passes to 
step S210 in which the various choices which can be 
selected by the user at this stage for the rule field <af- 
tribute location> are presented to the user. The permit- 
ted choices are "heuristic matching", "data matching", 
35 "domain matching" and null. The flow of the method re- 
mains at step S210 until the user selects one of these 
options (unless the user chooses to quit the program 
which can be done at any time causing no rule to be 
created). 

40 [0054] Upon completion of step S21 0, flow passes to 
step S300 in which it is determined if the user selected 
"heuristic matching" in step S210. If so, flow passes to 
step S310 (see Figure 5b). 

[0055] in step S31 0 the various choices which can be 
45 selected by the user at this stage for the rule field at- 
tribute content> are presented to the user. The permit- 
ted choices are "biggest size" and "bold font". Of course 
in a more sophisticated embodiment more possible 
choices would be presented to the user such as colour, 
so font type, etc. Once the user has selected one of these 
options flow passes to step S320. 
[0056] At step S320, the various choices which can 
be selected by the user at this stage for the rule field 
<attribute range> are presented to the user. The permit- 
55 ted choices are "top", "middle", "bottom" and "whole". 
Once the user has selected one of these options flow 
passes to step S330. 

[0057] At step S330, the method determines that, in 
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the present embodiment, there is only one possible 
choice for the next rule field <data extraction method> 
given that the rule field <attribute location> was set in 
step S210 as "heuristic matching" and therefore the user 
interface in the current implementation automatically 
sets the rule field <data extraction method> to the only 
permitted choice of "direct". This automatic selection of 
this field is, however, displayed to the user to keep the 
user informed of what is going on. 
[0058] Upon completion of step S330, flow passes to 
step S340 in which again there is only one permitted 
choice forthe rule field <datacontent> which is therefore 
automatically set to null. 

[0059] Upon completion of step S340, flow passes to 
step S350 in which again there is only one permitted 
choice for the rule field <data range> which is therefore 
automatically set to the sole permitted option of null. 
[0060] Upon completion of step S350, flow passes to 
step S360 in which the entire rule as thus created is dis- 
played to the user. Additionally, in the present embodi- 
ment, the thus created rule is saved in a previously spec- 
ified location as a Java class file. 
[0061] Upon completion of step S360, flow passes to 
step S370 in which the user interface requests the user 
to indicate whether there are any more rules which the 
user wishes to create in the current set of rules. If the 
user indicates that there are no further rules to be cre- 
ated in this set the method ends. Otherwise, flow passes 
back to step S200. 

[0062] If at step S300 it was determined that heu ristic 
matching had not been selected in step S210, then flow 
passes to step S400 in which it is determined if "data 
matching" was selected in step S210 and, if so, flow 
passes to step S410 (see Figure 5c). 
[0063] In step S410, the user is requested to input a 
desired value forthe rule field <attribute contents Since 
data matching was selected in the preceding step, this 
field holds the word or words which will be searched for 
by the focateAttribute method when executing the com- 
pleted rule. The user interface informs the user that this 
is the case and continues to permit entry of further text 
strings until the user has entered all possibilities. 
[0064] Upon completion of step S41 0, flow passes to 
step S420 in which the various choices which can be 
selected by the user at this stage for the rule field <af- 
tribute range> are presented to the user. The permitted 
choices are "top", "middle", "bottom" and "whole". Once 
the user has selected one of these options flow passes 
to step S430. 

[0065] At step S430, the method determines that, in 
the present embodiment, there is only one possible 
choice for the next rule field <data extraction method> 
given that the rule field <attribute location> was set in 
step S21 0 as "data matching" and therefore the user in- 
terface in the current implementation automatically sets 
the rule field <data extraction method> to the only per- 
mitted choice of "matched". 

[0066] Upon completion of step S430, flow passes to 



step S440 in which again there is only one permitted 
choice forthe rule field <data confenfc> which is therefore 
automatically set to null. 

[0067] Upon completion of step S440, flow passes to 
5 step S450 in which again there is only one permitted 
choice for the rule field <data range> which is therefore 
automatically set to the sole permitted option of null. 
[0068] Upon completion of step S450, flow passes to 
step S460 in which the entire rule as thus created is dis- 
10 played to the user. Additionally, in the present embodi- 
ment, the thus created rule is saved in a previously spec- 
ified location as a Java class file. 
[0069] Upon completion of step S460, flow passes to 
step S470 in which the user interface requests the user 
*5 to indicate whether there are any more rules which the 
user wishes to create in the current set of rules. If the 
user indicates that there are no further rules to be cre- 
ated in this set the method ends. Otherwise, flow passes 
back to step S200. 

[0070] If at step S40O it was determined that data 
matching had not been selected in step S21 0, then flow 
passes to step S500 in which it is determined if "domain 
matching" was selected in step S210 and, if so, flow 
passes to step S510 (see Figure 5d). 
[0071] In step S51 0, the user is requested to input a 
desired value forthe rule field <attribute content>. Since 
domain matching was selected in the preceding step, 
this field holds the key-word or words which will be 
searched for by the locateAttribute method when exe- 
cuting the completed rule in order to locate the position 
of the desired data within a few words. The user inter- 
face informs the user that this is the case and continues 
to permit entry of further text strings until the user has 
entered all possibilities. If more than one key-word is 
entered here, the tocateAttribute method will search for 
the first occurrence in the currently selected HTML file 
of any one of the enterd key-words. 
[0072] Upon completion of step S51 0, flow passes to 
step S520 in which the various choices which can be 
selected by the user at this stage for the rule field at- 
tribute range> are presented to the user. The permitted 
choices are "top", "middle", "bottom" and "whole". Once 
the user has selected one of these options flow passes 
to step S530. 

[0073] At step S530, the method determines that, in 
the present embodiment, there are only two possible 
choices for the next rule field <data extraction method> 
given that the rule field <attribute location> was set in 
step S51 0 as "domain matching" and therefore the user 
interface in the current implementation displays the two 
possible choices available to the user of "indirect" and 
"Boolean" and awaits selection of one of these options 
by the user. 

[0074] Upon completion of step S530, flow passes to 
step S540 in which it is determined whether or not "in- 
direct" was selected by the user for the field <data ex- 
traction method> in the preceding step. If "indirect" was 
not selected (and therefore "Boolean" was clearly se- 
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lected in its place) then flow proceeds to step S550 in 
which, because there is only a single possible allowed 
value for the rule field <data content> in such a case, 
the single allowed value of null is set. This setting is dis- 
played to the user. 

[0075] Upon completion of step S550, flow passes to 
step S555 in which, because there is only a single pos- 
sible allowed value forthe rulef ield <data range>\n such 
a case, the single allowed value of 0 is set. This setting 
is again displayed to the user. 

[0076] Upon completion of step S555, flow passes to 
step S570 discussed in greater detail below. 
[0077] If at step S540 it is determined that the user 
has selected "indirect" for the field <data extraction 
method> then flow passes to step S560 in which a 
number of permitted options are displayed for selection 
of one of the permitted options by the user. In the current 
implementation, the permitted options are "next curren- 
cy", "next level content", "next date" and "length of pe- 
riod". Of course, more sophisticated implementations 
could include many more such options. 
[0078] Upon completion of step S560, flow passes to 
step S565 in which the user is requested to either input 
a number indicating the number of words beyond the 
start of the domain location (as determined by the locat- 
eAttribute method in the case where the rule field at- 
tribute location> has the value "domain matching") 
which are to be selected by the cutDataPage method to 
generate the data to be stored in the intermediate 
processing field dataPageFractson, or to select the alter- 
native permitted option of "Sub-tree". If "Sub-tree" is se- 
lected, then the cutDataPage method wilt select all text 
which is "dominated" by the attribute detected using the 
locateAUributeiunc\\or\. The current implementation not 
only considers HTML tags (and in particular relying on 
the normal tree structure implied by nested tags) to de- 
termine if certain text is dominated by the located at- 
tribute (for instance all cells within a table are dominated 
by the table title, etc.) but also using heuristic rules. In 
the current implementation, one such heuristic rule 
which is used is that any text which is not in bold and 
which follows, without an intervening new line (or simi- 
lar) tag, from bold text, is dominated by the bold text. 
[0079] Upon completion of step S565, flow passes to 
step S570 in which the entire rule as thus created is dis- 
played to the user. Additionally, in the present embodi- 
ment, the thus created rule is saved in a previously spec- 
ified location as a Java class file. 
[0080] Upon completion of step S570, flow passes to 
step S580 in which the user interface requests the user 
to indicate whether there are any more rules which the 
user wishes to create in the current set of rules. If the 
user indicates that there are no further rules to be cre- 
ated in this set the method ends. 
Otherwise, flow passes back to step S200. 
[0081] If at step S500 it was determined that domain 
matching had not been selected in step S21 0, then flow 
passes to step S600 in which it is determined if null was 



selected in step S210 and, if so, flow passes to step 
S610 (see Figure 5e). 

[0082] In step S610, the method determines that, in 
the present embodiment, there is only one possible 
5 choice forthe rule field <attribute content> given that the 
rule field attribute location> was set in step S210 as 
null, namely the choice null. As such, this value is auto- 
matically set by the user interface and then flow pro- 
ceedsto step S620. Notethatthe user interface displays 
10 to the user that this setting has been made. 

[0083] Again in step S620, there is only one possible 
choice for the field attribute range>, namely null, in 
view of the selection of the value null at step S210. 
[0084] Therefore this value is automatically set by the 
15 user interface and then flow proceeds to step S630. 
Note that the user interface again displays to the user 
that this setting has been made. 
[0085] At step S630, the method determines that, in 
the present embodiment, there is only one possible 
20 choice for the next rule field <data extraction method> 
given that the rule field attribute location> was set in 
step S21 0 as null and therefore the user interface in the 
current implementation automatically sets the rule field 
<data extraction method> to the only permitted choice 
25 of "generate". Notethatthe user interface again displays 
to the user that this setting has been made. 
[0086] Upon completion of step S630, flow passes to 
step S640 in which the user is requested to enter either 
two values of the rule field attribute name> in respect 
so of two other (earlier) rules or a single text string. In both 
cases, the values entered are stored in the rule field 
<data contents When the method extractData detects 
that the rule field <data extraction method> is set to 
"generate", it checks to see if two strings are entered in 
35 the rule field <data content>. 

[0087] If the method extractData determines that 
there are two strings stored in the rule field <data con- 
tent> when the method is run, it checks that they corre- 
spond to rules which have already been processed in 
40 respect of the current HTML file and, if so, it extracts 
from the field pair, associated with each respective cor- 
responding rule, the data associated with thef ield attnb- 
uteData and combines the two thus "extracted" sets of 
data. For example, in the case of Wrapping Rule 7 set 
45 out above, the rule field <data content> stores the val- 
ues "Brand_of _Mobilephone" and "Service_Provider". 
During execution of the method extractData, it checks 
that these correspond to earlier rules (in this case Wrap- 
ping Rules 1 and 2 respectively) and determines from 
so the pair fields associated with these rules (after these 
have been applied to the current HTML file) the corre- 
sponding attributeData's. In the case of the HTML file 
illustrated in Figure 4, this gives rise to the extractData 
method returning the value "Nokia 3310_btcellnef and 
55 storing this in the intermediate processing field attnbu- 
teData. 

[0088] If the method extractData determines that 
there is only a single string present in the rule field <data 
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content> then it simply returns this value directly. For 
example in Wrapping Rule 8 set out above, the value 
URL1 is returned by the method extractData and stored 
in the intermediate processing field attributeData. 
[0089] Upon completion of step S640, flow passes to 
step S650 in which again there is only one permitted 
choice for the rule field <data range> which is therefore 
automatically set to the sole permitted option of null. 
[0090] Upon completion of step S650, flow passes to 
step S660 in which the entire rule as thus created is dis- 
played to the user. Additionally, in the present embodi- 
ment, the thus created rule is saved in a previously spec- 
ified location as a Java class file. 
[0091] Upon completion of step S660, flow passes to 
step S670 in which the user interface requests the user 
to indicate whether there are any more rules which the 
user wishes to create in the current set of rules. If the 
user indicates that there are no further rules to be cre- 
ated in this set the method ends. 
Otherwise, flow passes back to step S200. 

Description of the Second Embodiment 

[0092] A second embodiment of the present invention 
will now be described with reference to Figures 6 and 7. 
Figure 6 is a screen dump of an extract from a web page 
of an example curriculum vitae. By using a different set 
of wrapping rules (Wrapping rules 9 to 11 set out below) 
the wrapping program described above with reference 
to Figures 2 and 3 can be used to extract information 
from web-pages such as that illustrated in Figure 6. 
[0093] Figure 7 is a block diagram of the second em- 
bodiment in which a computer 110, which is connected 
to the Internet 130 via a data connection 111, runs the 
wrapping program described above with reference to 
Figures 2 and 3 using Wrapping rules 9 to 11 (set out 
below) to process HTML files containing CVs of poten- 
tial applicants. The HTML files of CVs may be sent, for 
example, by a recruitment company. The wrapping pro- 
gram processes the HTML files to attempt to extract the 
information of interest to the user of computer 110, and 
alerts the user if certain conditions are met so that the 
user may read the entire CV if it is likely to be of interest. 
[0094] The Wrapping rules used in this embodiment 
are: 

Extraction Rule 9: 

attribute name: "Full Name" 
attribute location: "domain matching" 
attribute content: "Full name" 
attribute range: "whole" 
data extraction method: "indirect" 
data content: "next level content" 
data range: sub-tree 

Extraction Rule 10: 



attribute name: "Date_of Birth" 
attribute location: "domain matching" 
attribute content: "Date of Birth" 
attribute range: "whole" 
5 data extraction method: "indirect" 

data content: "Next Date" 
data range: sub-tree 

Extraction-Rule 11: 

10 

attribute name: "Java" 
attribute location: "domain matching" 
attribute content: "Java" 
attribute range: "whole" 
15 data extraction method: "indirect" 

data content: "length of period" 
data range: 6 

[0095] When Wrapping rule 9 is employed by the 
20 wrapping program to process the HTML file illustrated 
in Figure 6, the following steps occur: 



[0096] By substituting the parameter in the above 
statement with the value specified in Wrapping rule 9, 
we have: attributePageFraction <- cutAttributePage 
(Page, "whole"). Here Page refers to the one shown in 
30 Figure 6. The function of cutAttributePage{Page, 
Rule." attribute range") is used to return a fraction of the 
Page specified by Rule, "attribute range". In this case, 
the whole original page is cut out and stored in attrib- 
utePageFraction. 

35 

Step 2: attributeLocation <- locateAttribute 
(Rule ."attribute location", Rule. "attribute content", 
attributePageFraction) . 

[0097] By substituting the parameters in the above 
statement with the values specified in Wrapping rule 9, 
we have: attributeLocation <- locate Attribute^ 1 <5on\a\n 
matching", "full name", attributePageFraction). Here the 
function of locateAttribute(Rute. "attribute location", 

45 Rule, "attribute content', attributePageFraction) is used 
to locate the attribute within attributePageFraction by 
using Rule. "attribute location" and Rule "attribute con- 
tent. In this case, a domain matching rule, "full name", 
is used to locate the attribute "Full_Name\ This rule is 

so executed by searching for "Full name" in attributePage- 
Fraction. Once "Full name" is found, this attribute will be 
located. 

Step 3: dataPageFraction <- cutDataPage 
55 (attributePageFraction, attributeLocation, Ru\e. n data 
range 11 ). 

[0098] By substituting the parameter Rule."data 



Step 1 : attributePageFraction <- cutAttributePage 
(Page, Rule, "attribute ranger). 
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range" in the above statement with the value specified 
in the Wrapping rule 9 ( we have dataPageExtraction <- 
cutDataPage(attributePageFraction, attributeLocation, 
sub-tree). The function of cutDataPage(attributePage- 
Fraction, attributeLocation, Rule."data range") is used 
to cut a minimum fraction of page out of attributePage- 
Fraction where the attribute and the data associated 
with it must occur in. attributeLocation specifies the 
starting point of this page fraction and Rule."data range" 
specifies the length of this page fraction. 
[0099] We note that we generate a tree structure out 
of the content of the web page. Besides the normal tree 
structure implied by the nested tags, for instance, each 
cell of a table is dominated by the table, we also expand 
the tree structure by using some heuristic rules. One of 

the rules is: . 

IF we detect "Textl Text2", where "Textl is in bold 

AND Text2 is in ordinary format 0 

THEN we have Textl dominates Text2". 

[0100] For instance if we got "Full name John Smith", 

we have "Full name" dominates "John Smith". 
[0101] By assigning "sub-tree" to Rule." data range', 
this means that a sub-tree starting from the location of 
current attribute will be cut out. At the end of this step, 
dataPageFraction will be assigned the content. 

Step A: attributeData <- extractDatafdataPageFraction, 
Ru\e"data extraction method", Rule."dafa content). 

[0102] By substituting the parameters in the above 
statement with the values specified in the Wrapping rule 
9 we have attributeData <- ExtractData("Full name: 
John Smith", "indirect", "next level content"). The func- 
tion of ExtractData(DataPageFraction, Rule/'data ex- 
traction method", Rule."data content") is to extract data 
from a fraction of web page specified by dataPageFrac- 
tion by using data extraction method specified by 
Rule "data extraction method' and Ru\e. u data content*. 
In this case, "John Smith" will be returned indirectly as 
the data for the attribute "FulLName", since it is the next 
level content in the expanded tree structure. 

Step 5: pair<- generatePair{Ru\e. n attribute name", 
attributeData). 

[0103] By substituting the parameters in the above 
statement with the values specified in the extraction 
Rule 1 and previously calculated, we have pair <- gen- 
eratePai/fFulLName", "John Smith"). The function of 
generatePaKBrte-attribute name\ attributeData) is to 
generate and return a pair of {Rute.'attribute name", at- 
tributeData). In this case ("FulLName", "John Smith") is 
generated and assigned to Pair. 
r0104] When Wrapping rule 10 is employed by the 
wrapping program to process the HTML file illustrated 
in Figure 6, the procedure is similar to that descnbed 
above with respect to Wrapping rule 9 exceptthat at step 
3 the sub-tree which is extracted and stored in the inter- 



mediate processing field dataPageFraction is "Date of 
Birth: 27/11/69". 

[0105] At Step 4, the method of "Next Date is used 
to convert "27/11/69" to the format required by the sys- 
5 tern, say "27-11-69". 

[0106] At Step 5, ("Date„of_BirthV27-11-69") is re- 
turned as a pair. " , 
r0107] When Wrapping rule 11 is employed by the 
wrapping program to process the HTML file illustrated 
10 in Figure 6, the following steps take place: 

Step 1: attributePageFraction <- cutAttrib- 
utePage{Page, Rule, "attribute range"). 

=> attributePageFraction <- cutAttributePage 

15 (Page, "whole"). 

By specifying "whole" in the function of CutAttnb- 
utePage.the returned page is still the same as Page 
that is shown in Figure 6. 
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Step 2: attributeLocation <- locateAttribute 
(Rule.-attribute location", Rule. "attribute con- 
tent", attributePageFraction) 

=> attributeLocation <- iocate Attribute? 'domain 
matchingYjava", attributePageFraction). 
By specifying "domain matching" in the function of 
tocateAttributeQ, the engine will search in the page 
attributePageFraction for the name "java". Once 
"java" is found, its location will assigned to attribute- 
Location. 

Step 3: dataPageFraction <- cutDataPage(attrib- 
utePageFraction, attributeLocation, Rule."data 

range"). n 

=> dataPageFraction <- cutDataPagefattnb- 

utePageFraction, "java", 6). 
By specifying 6 as data range, we will get six words 
starting from "java". The result of DataPageFraction 
is "Java ( 2 years ), applet and application". 
"Java (2 years), applet and application." 

Step 4: attributeData <- extractDatafdataPage- 
Fraction, Rule, "data extraction method", 
Rule, "data content"). 

=> attributeData <- exfracfDafa("Java (2 years), ap- 
plet and application", "indirect", "length of period"). 
By specifying "length of period" as the data extrac- 
tion method, "2 years" is be extracted from "Java (2 
years), applet and application" as the first period 
span. 

Step 5: pair <- generatePair(Rule. attribute, at- 
tributeData). 

^> pair<- generatePair(° Java", "2 years"). 
("Java", "2 years") will be returned and assigned to 
pair. 
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Discussion of alternatives 

[0108] In place of processing HTML files, the wrap- 
ping program could be adapted to process any number 
of other formats of, largely text based, files. For exam- 
ple, simple text files, formatted text files rich-text files, 
and word-processing files such as Microsoft WORD 
documents, etc could all be processed using similar 
wrapping program techniques. 



Claims 

1 . A method of extracting information from a specified 
ordered group of data containing semi-structured 
data, the method comprising the steps of: 

extracting data from within the specified or-* 
dered group of data from a location as specified 
by a plurality of rules; wherein 
the rules specify the location, within the speci- 
fied ordered group of data, of data to be extract- 
ed, in terms of visual characteristics of the data 
to be extracted as they appear when the spec- 
ified ordered group of data is displayed using a 
computer application capable of displaying the 
semi-structured data in a manner determined 
by formatting data included in the specified or- 
dered group of data. 

2. A method as claimed in Claim 1 wherein the ordered 
group of data is a data file storing HTML content. 

3. A method as claimed m Claim 2 wherein the rules 
specify the location of the data to be extracted in 
terms of visual characteristics of the data as they 
appear when the data file is displayed using a web 
browser, and the formatting data included in the da- 
ta file is in the form of HTML tags. 

4. A method as claimed in claim 3 wherein the rules 
are formed from a combination of one or more spec- 
ifications of the following visual characteristics of 
the data to be extracted as they appear, relative to 
the appearance of the specified data file as a whole, 
when the specified file is displayed within the brows- 
er: the fractional area within the specified file, as it 
appears within the displaying computer application, 
of the data; the relative size of the font of the data 
if it is in the form of text; the colour of the data; the 
proximity of the data to a specified keyword; and the 
location of the data relative to a specified keyword. 

5. A method of generating a set of rules for extracting 
one or more sets of data from a specified ordered 
group of data containing semi -structured data, the 
method comprising the steps of: 



controlling the display of a list of names of vis- 
ual characteristics suitable for describing the 
location of a set of data to be extracted from a 
specified ordered group of data containing 
semi-structured data; and 
generating one or more rules involving a com- 
bination of one or more of the displayed char- 
acteristics together with one or more specified 
values of the characteristics, in response to in- 
fo put signals generated by a user. 

6. A method as claimed in claim 5 further comprising 
the step of converting the rules into computer im- 
plementable instructions for processing an input file 
is containing HTML content and outputting the data 
which would be displayed at the location specified 
by the rules if the input file were displayed in a 
browser for displaying HTML content. 

20 7. a method as claimed in claim 6 wherein the com- 
puter implementable instructions are in the form of 
a Java class file. 

8. Computer processing means for carrying out the 
25 method of any one of claims 1 to 4, including read- 
ing means for reading the specified file, a memory 
and writing means for writing extracted data to the 
memory. 

30 9. # device for generating a set of rules for extracting 
one or more sets of data from a specified ordered 
group of data containing semi-structured data com- 
prising: 

35 display control means for controlling the display 

of a list of names of visual characteristics suit- 
able for describing the location of a set of data 
to be extracted from a specified ordered group 
of data containing semi-structured data; and 

40 rule generation means for generating one or 

more rules comprising a combination of one or 
more of the displayed characteristics together 
with one or more specified values of the char- 
acteristics, wherein the specified values of the 

45 characteristics are set in response to input sig- 

nals generated by a user via a user interface. 

10. A device as claimed in claim 9 further including con- 
version means for converting the generated rules 

so into computer implementable instructions for 
processing an input file containing HTML content 
and outputting the data which would be displayed 
at the location specified by the rules if the input file 
were displayed in a browser for displaying HTML 

55 content. 
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